

# **Endpoints Proposal Update**

Jim Dinan MPI Forum Hybrid Working Group June, 2014

## **Outline**

- 1. Big picture, performance perspective
- 2. New performance studies
  - Extended journal version of EuroMPI paper
    - Measure process/thread performance tradeoffs
  - Implementation paper submitted to SC
    - Results consistent with claims, will present in Japan
- 3. Endpoints interface review
- 4. Recent developments in the proposal
  - Primarily address gaps that arise when MPI processes share a virtual address space
  - Query function, communicator/group comparisons





# **Today: Threads/Processes Tradeoff**



Threads/proc. are entangled, users must make tradeoff

- Benefits of threads to node-level performance/resources
- Versus benefits of processes to communication throughput





## **Future: MPI Endpoints**



Enable threads to achieve process-like communication perf.

- Eliminate negative interference between threads
  - Both semantics (ordering) and mechanics (implementation issues)
- Enable threads to drive independent traffic injection/extraction points





## **Measure Process/Thread Tradeoffs**

"Enabling Communication Concurrency through Flexible MPI Endpoints." James Dinan, Ryan E. Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. Submitted to IJHPCA.

Look at communication performance trends in many-core system

Identify/measure opportunities for endpoints to improve performance

#### System setup:

- Intel® Xeon Phi<sup>™</sup> 5110P "Knight's Corner" Coprocessors
  - 60 cores @ 1.053 GHz, 8GB memory, 4-way hyperthreading
- Intel MPI Library v4.1 update 1, Intel C Compiler v13.1.2
  - Run in native mode Phi cores do MPI processing
- Mellanox 4x QDR InfiniBand, max bandwidth 32 GB/sec

Intended to represent future systems where network is designed to support traffic from many cores





# Impact of Increasing Num. Processes





#### Measure communication performance between two nodes

- OSU benchmark N senders and N receivers per node
- Performance increases with more processes (P) driving communication

#### Processes represent "ideal" endpoints

- Private communication state and communication resources
- Represent performance upper bound





# Threads/Proc. Tradeoff (Msg. Rate)



#### 8 Cores





"I have N cores, how do I use them?"

- Threads address node level concerns (e.g. memory pressure)
- Processes provide better communication performance

Endpoints will enable threads to behave like processes

- Decouple threads in mechanics private communication state
- Decouple threads in semantics isolate in message ordering





## Threads/Proc. Tradeoff (BW)

16 Cores



8 Cores



Thread/processes tradeoff impacts bandwidth

Saturate for 32KiB+ messages

More processes = better throughout for smaller messages



# **MPI Endpoints**

Relax the 1-to-1 mapping of ranks to threads/processes





## **MPI Endpoints Semantics**



MPI Process: Set of resources supporting execution of MPI comm.s

- MPI rank and execution resources to drive it when needed
- Endpoints have MPI process semantics (e.g. progress, matching, ...)
  - Collectives are called concurrently on all endpoints (MPI processes)

#### Improve programmability of MPI + Threads

- Allow threads to be MPI processes, addressable through MPI
- Make number of VA spaces free parameter for apps

Enable threads to act like processes / have process-like performance

Per-thread communication state/resources, process-like performance



## **MPI Endpoints API**



Creates new MPI ranks from existing ranks in parent comm.

Each process in parent comm. requests a number of endpoints

Outputs handles correspond to different ranks in the same comm.

Takes TLS out of the implementation and off the critical path

Can return MPI\_ERR\_ENDPOINTS if endpoints could not be created





# **Example** MPI+OpenMP lybrid ith Endp

## int main(int argc, char \*\*argv) { int world rank, tl; int max threads = omp get max threads(); MPI Comm ep comm[max threads]; MPI Init thread(&argc, &argv, MPI THREAD MULTIPLE, &tl); MPI Comm rank(MPI COMM WORLD, &world rank); #pragma omp parallel int nt = omp\_get\_num\_threads(); int tn = omp get thread num(); int ep\_rank; #pragma omp master MPI\_Comm\_create\_endpoints(MPI\_COMM\_WORLD, nt, MPI\_INFO\_NULL, ep\_comm); #pragma omp barrier MPI Comm rank(ep comm[tn], &ep rank); ... // Do work based on 'ep rank' MPI Allreduce(..., ep comm[tn]); MPI\_Comm\_free(&ep\_comm[tn]); MPI Finalize();



## **Recent Developments**

- 1. MPI\_Comm\_free() semantics
- 2. Query function
- 3. Communicator comparison
- 4. Group comparison



# MPI\_Comm\_free() with Endpoints

Past: Called in series on endpoints in a hosting
MPI\_COMM\_WORLD process

- Enable endpoints communicators to be freed when using thread level < MPI\_THREAD\_MULTIPLE</li>
- Changes MPI\_Comm\_free semantics, would have to always do a hierarchical free algorithm on endpoints even when using MPI\_THREAD\_MULTIPLE
- Breaks collective semantic that is expected by attribute callbacks

New approach: Leave this undefined for now, can define MPI\_Comm\_free\_endpoints in future

Would need to forbid collective attribute callbacks





# **Query Function: What to query?**

#### Find out if processes share an address space

- Libraries need to determine whether memory and other resources are private to an MPI rank
- Existing issue when MPI processes impl. as threads
- Also an issue with MPI endpoints
- Pursuing as a separate ticket (#425)

#### Query number of endpoints:

- Split with MPI\_COMM\_TYPE\_ENDPOINTS
- MPI\_Comm\_num\_endpoints(MPI\_Comm comm, int \*num\_ep, int \*ep\_id)

## Query if any processes in comm. are in same VA space:

MPI\_[Comm,Group,Win,File, ]\_has\_sharing(..., int \*flag)





# **Communicator Comparison**

When MPI processes share a VA space, it becomes possible for a process to see multiple handles to different ranks in the same communicator

Currently comm. comparison can result in: MPI\_IDENT, MPI\_CONGRUENT, MPI\_UNEQUAL

Past: Return MPI\_IDENT, then check ranks to see if they differ.

New proposal: Return MPI\_ALIASED to indicate same object, different ranks



# **Group Comparison**

out\_comm\_handles[0] corresponds to calling process'
rank in parent communicator

All other output ranks are new ranks that don't appear in any other group

Needed for group operations to make sense

- Consider my\_num\_ep == 1 at all processes
- Comparison with parent group should be CONGRUENT
- MPI\_Group\_translate\_ranks between output and parent communicators should yield same ranks





## **More Info**

## **Endpoints:**

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/380

## Hybrid Working Group:

https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ MPI3Hybrid



## **Legal Disclaimer & Optimization Notice**

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804



